[SOUND]
Hello.
Welcome to the course Text Mining and
Analytics.
My name is ChengXiang Zhai.
I have a nickname, Cheng.
I am a professor of the Department of
Computer Science at the University of
Illinois at Urbana-Champaign.
This course is a part of
a data mining specialization
offered by the University of
Illinois at Urbana-Champaign.
In addition to this course,
there are four other courses offered by
Professor Jiawei Han,
Professor John Hart and me, followed by
a capstone project course that
all of us will teach together.
This course is particularly related to
another course in the specialization,
mainly text retrieval and search engines
in that both courses are about text data.
In contrast, pattern discovery and
cluster analysis are about
algorithms more applicable to
all kinds of data in general.
The visualization course is also
relatively general in that the techniques
can be applied to all kinds of data.
This course addresses a pressing need for
harnessing big text data.
Text data has been growing
dramatically recently,
mostly because of the advance of
technologies deployed on the web
that would enable people to
quickly generate text data.
So, I listed some of
the examples on this slide
that can show a variety of text
data that are available today.
For example, if you think about
the data on the internet, on the web,
everyday we are seeing many
web pages being created.
Blogs are another kind
of new text data that
are being generated quickly by people.
Anyone can write a blog
article on the web.
New articles of course have always been
a main kind of text data that
being generated everyday.
Emails are yet another kind of text data.
And literature is also representing
a large portion of text data.
It's also especially very important
because of the high quality
in the data.
That is,
we encode our knowledge about the word
using text data represented by
all the literature articles.
It's a vast amount of knowledge of
all the text and
data in these literature articles.
Twitter is another representative
text data representing social media.
Of course there are forums as well.
People are generating tweets very quickly
indeed as we are speaking perhaps many
people have already written many tweets.
So, as you can see there
are all kinds of text data
that are being generated very quickly.
Now these text data present
some challenges for people.
It's very hard for anyone to
digest all the text data quickly.
In particular, it's impossible for
scientists to read all of the for
example or for
anyone to read all the tweets.
So there's a need for tools to help
people digest text data more efficiently.
There is also another
interesting opportunity
provided by such big text data, and
that is it's possible to leverage
the amount of text data to
discover interesting patterns to
turn text data into actionable knowledge
that can be useful for decision making.
So for example,
product managers may be interested
in knowing the feedback of
customers about their products,
knowing how well their
products are being received as
compared with the products of competitors.
This can be a good opportunity for
leveraging text data as we have seen
a lot of reviews of product on the web.
So if we can develop a master text
mining techniques to tap into such
a [INAUDIBLE] to extract the knowledge and
opinions of people about these products,
then we can help these product managers
to gain business intelligence or
to essentially feedback
from their customers.
In scientific research, for example,
scientists are interested in knowing
the trends of research topics, knowing
about what related fields have discovered.
This problem is especially important
in biology research as well.
Different communities tend to
use different terminologies, yet
they're starting very similar problems.
So how can we integrate the knowledge
that is covered in different communities
to help study a particular problem?
It's very important, and
it can speed up scientific discovery.
So there are many such examples
where we can leverage the text data
to discover useable knowledge
to optimize our decision.
The main techniques for
harnessing big text data are text
retrieval and text mining.
So these are two very much
related technologies.Yet,
they have somewhat different purposes.
These two kinds of techniques are covered
in the tool in this specialization.
So, text retrieval on search
engines covers text retrieval,
and this is necessary to
turn big text data into
a much smaller but more relevant text
data, which are often the data that
we need to handle a particular problem or
to optimize a particular decision.
This course covers text mining which
is a second step in this pipeline
that can be used to further process
the small amount of relevant data
to extract the knowledge or to help
people digest the text data easily.
So the two courses are clearly related,
in fact,
some of the techniques are shared by
both text retrieval and text mining.
If you have already taken the text
retrieval course, then you might see
some of the content being repeated
in this text mining course, although
we'll be talking about the techniques
from a very different perspective.
If you have not taken
the text retrieval course,
it's also fine because this
course is self-contained and
you can certainly understand all of
the materials without a problem.
Of course, you might find it
beneficial to take both courses and
that will give you a very complete set
of skills to handle big text data.
[MUSIC]

